-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling arm64: topology and online information #2744
Handling arm64: topology and online information #2744
Conversation
e45f35d
to
ca5682e
Compare
@bobbypage it seems that this fix could make some folks in Kubernetes community merrier: kubernetes/kubernetes#95039. Might be worth being reviewed. |
Thanks, LGTM for changes here. Only question/concern: the changes here basically make the assumption that arm64 does need to check I don't know too much about cpu hotplug / arm64 support, but just wondering based on #2744 (comment), is something weird with Raspberry Pi 4 specifically? Just thinking that maybe there's some other ARM64 machines that do support cpu hotplug and might need the online check? LMK what you think, and if this is actually a real concern. |
See also related in #2677 regarding if kernel complied with |
I looked at Graviton2 EC2 instance yesterday and it behaves as expected: online file is there. Perhaps better way to fix it would be to assume that lack of online file means lack of hotplug capabilities, irrespective of architecture. |
Yeah, that's kinda what I was thinking, perhaps that's a better fix? WDYT? |
It will require pretty extensive testing on various architectures and configurations. We should at least test:
I will try to conduct arm testing on Graviton2 and perhaps some other arm that I will be able to get hold of. X86 should not be an issue. My original intention was to avoid changes to x86 code - this is why I decided to limit changes to arm only. |
4152655
to
2fa1a16
Compare
5a282b6
to
0e4529d
Compare
utils/sysfs/sysfs.go
Outdated
// getOnlineStateID looks for line similar to `206: online` from /sys/devices/system/cpu/hotplug/states | ||
// and extracts number from the beginning. | ||
// See: https://www.kernel.org/doc/html/latest/core-api/cpu_hotplug.html#testing-of-hotplug-states | ||
func getOnlineStateID(cpuPath string) (int, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that it would make sense to cache result of this function: it can't change when system is running, I believe. Do you think that using package-level variable to cache the output makes sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it can't change, I guess doesn't hurt to cache it. Re using package level variable, I think it's fine, e.g. see this existing example: https://github.com/google/cadvisor/blob/master/machine/machine.go#L53
utils/sysfs/sysfs.go
Outdated
// isCPUOnline verifies if a CPU in online. Takes path to /sys/devices/system/cpu/cpu*/ directory as a parameter. | ||
func isCPUOnline(cpuPath string) bool { | ||
if isx86() && isZeroCPU(cpuPath) { | ||
// It *is* possible to enable CPU hotplug for CPU0 but it requires setting a boot flag. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it is safe, but I can't promise it. Do you think we should check kernel boot params?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any reason we need special logic for cpu0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or actually just looking at my 32 core ubuntu GCE VM:
root@ubuntu-big:~# find /sys/devices/system/cpu/ | grep "online"
/sys/devices/system/cpu/cpu21/online
/sys/devices/system/cpu/cpu11/online
/sys/devices/system/cpu/cpu9/online
/sys/devices/system/cpu/cpu7/online
/sys/devices/system/cpu/cpu28/online
/sys/devices/system/cpu/cpu18/online
/sys/devices/system/cpu/cpu5/online
/sys/devices/system/cpu/cpu26/online
/sys/devices/system/cpu/cpu16/online
/sys/devices/system/cpu/cpu3/online
/sys/devices/system/cpu/cpu24/online
/sys/devices/system/cpu/online
/sys/devices/system/cpu/cpu14/online
/sys/devices/system/cpu/cpu1/online
/sys/devices/system/cpu/cpu22/online
/sys/devices/system/cpu/cpu12/online
/sys/devices/system/cpu/cpu30/online
/sys/devices/system/cpu/cpu20/online
/sys/devices/system/cpu/cpu10/online
/sys/devices/system/cpu/cpu8/online
/sys/devices/system/cpu/cpu29/online
/sys/devices/system/cpu/cpu19/online
/sys/devices/system/cpu/cpu6/online
/sys/devices/system/cpu/cpu27/online
/sys/devices/system/cpu/cpu17/online
/sys/devices/system/cpu/cpu4/online
/sys/devices/system/cpu/cpu25/online
/sys/devices/system/cpu/cpu15/online
/sys/devices/system/cpu/cpu2/online
/sys/devices/system/cpu/cpu23/online
/sys/devices/system/cpu/cpu13/online
/sys/devices/system/cpu/cpu31/online
cpu0 online doesn't exist. is it why this case is needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general you should not put cpu0 offline on x86 because it is used to boot the operating system. If it's unavailable then you might not be able to start your system. I am extremely surprised by what you shared about your VM: can you share your kernel boot params? Perhaps it's a case of disabled cpu0 and I bet you haven't altered your configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a standard ubuntu GCE VM, created as follows
$ gcloud beta compute instances create ubuntu-big --machine-type=e2-standard-32 --image-family=ubuntu-2004-lts --image-project=ubuntu-os-cloud --boot-disk-size=256GB --boot-disk-type=pd-ssd --zone=us-central1-c
$ uname -a
Linux ubuntu-big 5.4.0-1029-gcp #31-Ubuntu SMP Wed Oct 21 19:38:01 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Here's KConfig from
sudo cat /boot/config-$(uname -r)
https://gist.github.com/bobbypage/3e65f6a6ad649f082760b9674bd02c5b
@bobbypage I have to tested all configurations yet, but I'm confident that code works on RPi4 and Graviton2. Can you take a look and let me know if this approach is fine with you? I will proceed with further testing then. |
utils/sysfs/sysfs.go
Outdated
return isCPUOnline(cpuPath) | ||
} | ||
|
||
// isCPUOnline verifies if a CPU in online. Takes path to /sys/devices/system/cpu/cpu*/ directory as a parameter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aren't we actually checking /sys/bus/cpu/devices/
which is defined as cpuBusPath
above?
by the way what is difference between /sys/bus/cpu/devices/
and /sys/devices/system/cpu/cpu*/
? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are, you are right. /sys/bus/cpu/devices/
consists of symlinks to /sys/devices/system/cpu/cpu*/
, I think.
utils/sysfs/sysfs.go
Outdated
if ok { | ||
if errors.Is(pathErr.Unwrap(), os.ErrNotExist) && isZeroCPU(dir) { | ||
// If error occurred and is os.PathError then it might be os.ErrNotExists. | ||
pathErr, isPathError := err.(*os.PathError) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it needed to do type assertion here? maybe can call os.IsNotExist directly, e.g. https://github.com/google/cadvisor/blob/master/machine/operatingsystem_unix.go#L41-L42
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
44709b1
to
66b9b9d
Compare
1133b45
to
1745c0a
Compare
@bobbypage it took a lot of time but now this very PR is ready to re-review. I consider it working. |
8847489
to
3232139
Compare
@bobbypage I have also squashed all the commits into something more manageable :) |
3232139
to
9dedba6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for all your work on this, a few minor comments but overall looks great!
utils/sysfs/sysfs.go
Outdated
func newCPUList(path string) (*cpuList, error) { | ||
online := cpuList{online: map[uint16]interface{}{}} | ||
fileContent, err := ioutil.ReadFile(path) | ||
if err != nil && os.IsNotExist(err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe consider refactoring the check of if hotPlugDisabled to a separate function and handling it in top level IsCPUOnline
as a short circuit rather than having IsCPUOnline
call -> NewCpuList and then cpus.isOnline
in case of hotplug being disabled.
998feb6
to
648523d
Compare
@bobbypage, let me know when you review it - I will squash the commits. |
utils/sysfs/sysfs.go
Outdated
online map[uint16]interface{} | ||
// online maps CPU ID to a struct for each online CPU. | ||
online map[uint16]struct{} | ||
// hotplugDisabled is true when kernel CPU hotplug is disable and all CPUs are assumed online. | ||
hotplugDisabled bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for adding the hot plug check upfront in IsCPUOnline
. Now that we have that check can we remove this and the duplicate check in newCPUList
?
@bobbypage it would be great if we finished this PR. Kubernetes users area still complaining about problems it fixes. |
LGTM, looks great, please squash and we'll be good to go! |
Signed-off-by: Maciej "Iwan" Iwanowski <maciej.iwanowski@critical.today>
Signed-off-by: Maciej "Iwan" Iwanowski <maciej.iwanowski@critical.today>
…ing count of unique properties Signed-off-by: Maciej "Iwan" Iwanowski <maciej.iwanowski@critical.today>
e69770d
to
fa294cb
Compare
thanks for all your hard work on this!! LGTM |
Fixes #2743 (to some extent). Fixes #2783 too.
According to kernel docs CONFIG_HOTPLUG_CPU is the only way of disabling or enabling cpu hotplug. It should therefore be safe to assume, that if
/sys/devices/system/cpu/cpuXXX/online
file does not exist then hotplug is disabled.The above might not be true on some systems, apparently. Raspberry Pi 4 (Cortex-A72) seems to be one such system.
Besides this:
GetMachineMemoryByType()
andgetUniqueCPUPropertyCount()
have been moved toutils/sysfs
where they belong (these functions read from sysfs).